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Many physical and chemical processes, such as folding of biopolymers, are best described as dy- 
namics on large combinatorial energy landscapes. A concise approximate description of the dynamics 
is obtained by partitioning the micro-states of the landscape into macro-states. Since most land- 
scapes of interest are not tractable analytically, the probabilities of transitions between macro-states 
need to be extracted numerically from the microscopic ones, typically by full enumeration of the 
state space or approximations using the Arrhenius law. Here we propose to approximate transition 
probabilities by a Markov chain Monte-Carlo method. For landscapes of the number partitioning 
problem and an RNA switch molecule we show that the method allows for accurate probability 
estimates with significantly reduced computational cost. 

PACS numbers: 05.10.Ln,87.15.H-,02.50.Ga 
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I. INTRODUCTION 

Energy landscapes [U-Q are a key concept for the de- 
scription of complex physical and biological systems. In 
particular, the dynamics of structure formation ("fold- 
ing") of biopolymers, e.g. protein or ribonucleic acids, 
can be understood in terms of their energy landscapes 
d, Formally, a landscape is determined by a set 
X of micro-states (or conformations), a neighborhood 
structure of X that encodes which conformations can 
be reached from which other ones, and an energy func- 
tion i? : X — 7> R which assigns an energy value to each 
state. In the case of ribonucleic acids (RNA) it has 
been demonstrated that the dynamics of the folding pro- 
cess can be captured in good approximation by merging 
large contiguous sets of micro-states into macro-states 
[i,l3|. A typical mapping is in terms of gradient basins: 
Each macro-state contains the micro-states from which a 
given local minimum is reached by steepest descent in en- 
ergy, including the local minimum itself. The so-defined 
macro-states are also called inherent structures in the 
context of continuous disordered systems, see ref. Q for 
a recent review. 

Given a partitioning of the landscape, the dynamics 
is approximately described as a Markov chain on the 
set of macro-states. In order to obtain this description, 
the transition probabilities between macro-states in this 
Markov chain need to be extracted from the original en- 
ergy landscape. 

As a first approximation, the Arrhenius equation pre- 
dicts that the transition probability is exponentially sup- 
pressed by the ratio between barrier height and temper- 
ature. The barrier height (also called activation energy) 
from minimum a to minimum b measures the minimal 
amount by which the system's energy must increase along 
a path from a to 6 [ad, [13] ■ The accuracy of this ap- 
proach is limited because it ignores the multiplicity of 



low energy paths [ll|. A more severe drawback is the 
complexity of computing barrier hei ght itself. For land- 
scapes of RNA secondary structure [l^l , the problem is 
NP-hard [ll,[il. 

Commonly used methods [1, [Tsl - fist for precise tran- 
sition rate estimation are based on enumeration of all 
micro-states. For landscapes of real combinatorial prob- 
lems or long biopolymers with billions of micro-states, 
however, enumeration is impractical with the given time 
resources. Typically, limited storage capacity puts even 
more severe restrictions on the size of tractable prob- 
lems because a large fraction of the enumerated micro- 
states needs to be kept in working memory. Some stud- 
ies partially circumvent this problem by considering only 
the low-energy fraction of the landscape that is tractable 
with the available resources [3, IT9l - l2l| . Other heuristic 
approaches [2^ - [26j restrict the landscape to the subset 
of states likely to be traversed by certain trajectories, 
e.g. folding from the open chain to the ground state of a 
biopolymer. 

Here we make a contribution to the original challenge 
of capturing an arbitrary discrete landscape in terms of 
macro-states and transition probabilities. We suggest a 
Markov chain Monte-Carlo sampling method for transi- 
tion matrix estimation. At difference with the earlier 
approaches, the memory requirement scales linearly with 
the number of non-zero transition probabilities to be de- 
termined. Other recent methods of stochastic landscape 
exploration [g^, HI] use trajectories of the original dy- 
namics for counting transitions between macro-states. In 
contrast, the idea behind the present method is to explic- 
itly explore boundaries between macro-states. To this 
end, we confine the dynamics into a single macro-state 
h and find and count possible transitions from 6 to all 
adjacent macro-states. This strategy allows to select the 
regions of the landscape to be explored and the precision 
to be applied. 
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II. LANDSCAPE AND MICRO-STATE 
DYNAMICS 

A discrete energy landscape is a triple {X, E, M) where 
• X is a. finite set of states, 



• E : X 



is an energy function on X , and 



• M : X — > V{X) is a neighborhood function or 
"move set" that assigns to each state x € X the set 
of its directly accessible neighboring states. ViX) 
is the power set of X. Here we assume that M 
is symmetric, i.e. x € M{y) =^ y €: M{x). By 
A we denote the maximum number of neighbors, 
A = max^gx \M{x)\. 

We consider a time-discrete stochastic dynamics on the 
state set X. Having the Markov property, the dynamics 
is defined by giving the transition probability Px^y from 
each X G X to each y S M{x). Provided the system is in 
state X at time t, px-iy is the probability that the system 
is in state y at time < + 1. With probability Px^x ~ 
1 — J2yeM{x)Px^V' system remains at state x. 

Specifically, the Metropolis probabilities at inverse 
temperature /3, 



Px 



A-i min{exp(/3[£;(x) - Eiy)]), 1} (1) 



are used throughout this contribution. This choice, 
however, is not compulsory. All that follows, and in 
particular the estimation by sampling, applies to arbi- 
trary choices of transition probabilities leading to ergodic 
Markov chains. The ergodicity is important because we 
need a unique stationary distribution P{x) on X. 



III. PARTITIONING AND MACRO-STATE 
DYNAMICS 

A partitioning of the landscape is a mapping F from 
the set of micro-states X into a set of macro-states B. 
Our goal here is to find a dynamics on B that does 
have the Markov property while following the original 
micro-state dynamics as closely as possible. In general, 
however, a Markov chain is not obtained as the direct 
mapping {F{xt))'^Q of a Markov chain {xt)'^Q generated 
by the dynamics on X. The reason can be sketched as 
follows. When the system is in a macro-state b € B, 
the probability of exiting to a macro-state c depends on 
where exactly (in which micro-state) the system is inside 
b. The micro-state assumed inside b, however, depends 
on how the system entered b, which is again influenced 
by the macro-state a assumed before entering b. 

Thus, the following simplifying assumption is made 
[2^. Given that the system is found in macro-state 
b € B, the micro-state x ^ X is distributed as 

pr^.(Pi^)/j:yeF-^^t)Piy) ifxeF-i(6) 
\ otherwise. 



This is the stationary distribution P of the whole system 
restricted to micro-states in b and normalized appropri- 
ately. Under this assumption, the probability of a tran- 
sition to macro-state c, when being in macro-state b ^ c 
is 



96- 



(3) 



yeM{x}nF-^{c) 



The inner sum is the probability of going to a micro-state 
y belonging to macro-state c and being a neighbor of x, 
given that the system is in state x. The outer sum repre- 
sents the equilibrium weighting of the micro-states x in- 
side the given macro-state b. A straight-forward method 
determines the exact transition probabilities by perform- 
ing the sums in Eq. ([3]), i.e. exhaustive enumeration of 
all micro-states and all neighbors • 

Throughout this contribution, we consider the usual 
partitioning of X with respect to gradient basins but the 
method is not restricted to this choice. Two micro-states 
x,y € X lie in the same macro-state F{x) = F{y) if and 
only if the steepest descent walks starting in x and y 
terminate in the same local minimum. A state u £ X is 
called local minimum, if E{v) > E{u) for all v £ M{u). 
For a given landscape and partitioning, the macro-state 
transition probabilities can be estimated by the sampling 
algorithm presented in the next section. 



IV. SAMPLING METHOD 

The method we introduce computes an estimate of the 
transition probabilities q in Eq. ([3]) by a standard impor- 
tance sampling restricted to a macro-state b using the 
micro-state probabilities Pb{x) defined in Eq. Being 
in state Xt S at time i, a neighbor z e M{xt) is 

drawn at random with equal probabilities. The sugges- 
tion is accepted as the next state, Xt+i = z, with proba- 
bility min{l, Pij{z) / Pb{xt)} . Otherwise the state remains 
the same, Xt+i = Xt. This choice guarantees that the 
relative frequency of state x tends towards the relative 
frequency Pb{x) for increasing chain length t ^ 00 [sot . 
For a realization of a Markov chain of length imax, tran- 
sition probabilities arc estimated as 



1 



tmax 

■ E E p-^' 

t=l y£M{xt)r]F-^(c) 



(4) 



In practice, the inner summation is performed only once 
at each time i, because each neighbor y of xt contributes 
to the transition probability to exactly one macro-state 

F{y). 

Computation time is saved by storing visited micro- 
states of basin b and their sets of neighbors with tran- 
sition probabilities in a data structure with fast search 
access, e.g. in a hash table. This is particularly advanta- 
geous in cases with broadly distributed micro-state prob- 
abilities such as Boltzmann weights at low temperature. 
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Here the Markov chain will encounter the highly proba- 
ble (low energy) micro-states many times but neighbor 
sets and transition probabilities are computed only once 
per state. In the usual cases where macro-states are de- 
fined as basins of local minima, memory of visited states 
also saves time in evaluating the macro-state assignment 
function F: When the gradient walk starting at state x 
reaches a micro-state known to be in basin 6, x itself is 
known to belong to b. Thus in many cases the walk may 
be terminated before reaching the ground state. Keep- 
ing previously visited micro-states in memory, however, is 
not necessary for the method to work. It may be handled 
according to the available resources. One may simply 
stop storing micro-states when the designated memory 
has been filled. 

So far wc have described how to estimate probabili- 
ties of transitions from one macro-state b to others. The 
result is the 6-th column vector {q'ij_^^)ceB of the esti- 
mated transition matrix q' as given in Eq. |4l By ap- 
plying the procedure separately to each macro-state, the 
full matrix q' is obtained. This can be implemented as 
an iterative exploration of the energy landscape without 
initial knowledge of the set of macro-states. Whenever a 
neighbor y of a state x in the Markov chain belongs to 
a macro-state F{y) not previously seen, we add the pair 
{F{y),y) to a queue Q of macro-states yet to work on. 
Initially, Q may contain only one particular pair (6, xq), 
e.g. the completely unfolded state xq of a polymer and 
the corresponding macro-state b = F{xo). The itera- 
tive exploration of the landscape is implemented in the 
following loop, (i) Extract a pair (&, xq) from Q; (ii) gen- 
erate Markov chain inside 6, starting at xo', (hi) obtain 
estimates according to Eq. ^ and add newly discovered 
macro-states to Q; (iv) If Q is not empty, resume at (i). 
Note, this method is directly parallelizable and will easily 
profit from distributed computing. Several independent 
realizations of Markov chains with respect to different 
macro-states can be run simultaneously, extracting from 
and feeding to the same queue. An implementation of the 
method is part of the Energy Landscape Library [sij . 



V. NUMBER PARTITIONING LANDSCAPE 

The number partitioning problem (NPP) is a decision 
problem in the theory of computation and computational 
complexity [s^ - ls^ ]. It asks if a given set A oi N real 
non-negative numbers can be partitioned into two subsets 
B, C such that numbers in B have the same sum as those 
in C . In an equivalent formulation, we label the numbers 
in A as ai, . . . , oat and use spin variables xi, . . . ,xn to 
encode if at is in subset B {xi = or in subset C 
{xi = —1). This system has the set of micro-states X = 
{— 1, +1}^. We define the energy of state x ^ X a,s 

N 
i=l 



Then the NPP amounts to the question if the ground 
state energy of this system is zero. 

The number partitioning landscape is obtained by us- 
ing the hypercube as the neighborhood structure. For 
each X G X we have 

M{x)^{y(,X\d{x,y) = l} (6) 

as the set of neighbors. The usual Hamming distance d 
is used, so d{x, y) is the number of entries i such that 
Xi ^ yi. A local move on the landscape means flipping 
one of the N spin variables Xi. 

Random instances are typically generated by drawing 
the Oi as statistically independent random variables uni- 
formly distributed in the unit interval. Then the ex- 
pected number of local minima grows exponentially with 
N, more precisely {\B\) ^ 2^N~^/^ [3a]- Here we use 
special instances of the NPP where 

a. ^{^- 1)-" (7) 

with a — 0.55. For these instances, we have found the 
number of local minima to grow exponentially with N for 
iV < 40. However, the growth is much slower than for 
randomly generated instances. At = 40, the instance 
of Equation ([7]) has 318 local minima, to be compared 
with an expected number of « 10^^ local minima for 
randomly generated instances. Each landscape with 10 < 
A^ < 40 has at least one basin with an energy barrier > 
0.1. We explore the landscape at temperature 1//3 = 0.1. 

Figure [T] shows the convergence of the probability es- 
timates. For each system size A^, the sampling error de- 
creases inversely proportional to the number of sampling 
steps performed per basin. Larger systems need more 
computational effort to reach a certain precision. The in- 
set of Fig. [T] indicates that the total computational effort 
required for the error to fall below a given value grows 
sub-exponentially with A^, to be compared with a number 
of micro-states increasing as 2^. Thus under growing A^, 
sampling a strongly decreasing fraction of micro-states is 
sufficient in order to reach a given precision. 

VI. FOLDING LANDSCAPE OF AN RNA 
SWITCH 

As a real-world example of folding landscapes of 
biopolymers we consider RNA molecules [l^l- The pri- 
mary structure of an RNA molecule is a finite sequence 
(a string) over the alphabet of the four nuclear bases 
{ A , C , G , U} . An RNA secondary structure is a list of pairs 
(i, j) of positions in the primary structure such that the 
following conditions hold. (1) Base combinations at pair- 
ing positions must be A-U or G-C (Watson-Crick pairs) 
or G-U (wobble pair); (2) each position i can pair with 
at most one other position j; (3) there are no two pairs 
(«,j) and (fc,Z) with i < k < j < I. The latter condi- 
tion forbids so-called pseudoknots and makes the graph 
representation of a secondary structure outer-planar (see 
Fig.©. 
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FIG. 1: (Color online) The deviation of estimated transition 
probabilities from the exact values is inversely proportional 
to the number of sampling steps (main panel). Shown are the 
analyzed special instances of number partitioning landscapes 
(see Eq. [T]) for various sizes A*". Error bars (A^ = 15 and 
A'^ — 40) indicate the standard deviation between errors for 
different basins. The inset shows the A''-dependence of the to- 
tal number of sampling steps required for reaching a given pre- 
cision, i.e. lowering the error below r. Given a macro-state a, 
we employ the KuUback-Leibler (KL) divergence -D(.||..) [S^ 
to define the error as e{s,a) := D{q' {s, a)\\q' {2s,a)), making 
a comparison of the estimate of the outgoing transition prob- 
ability vector q'{s,a) = (gl_>6)bes(s) after s sampling steps 
with its estimate after 2s sampling steps. The plotted values 
are the equally weighted average of the errors e{s,a) over all 
macro-states a £ B. 



In the folding landscape of an RNA sequence, the set of 
micro-states X contains the valid secondary structures. 
The energy E{x) of a secondary structure x € X is a 
sum over binding energies of stacks (contiguous regions 
of binding) and entropic contributions from open (un- 
bound) sections of the RNA chain. For details of en- 
ergy calculations, we refer to the literature [1^ Hi], li^ . 
Micro-states x,y G X are adjacent, i.e. y G M{x) and 
X e M{y), if y can be generated from x by adding or 
removing a single base pair. Shift moves |37[ are not 
considered in this contribution. When the lowest energy 
neighbor of a structure is not unique the degeneracy is 
resolved by the lexicographic ordering on string represen- 
tations of the structures [1, 0, H . 

Multistable RNAs, so called RNA-switches, are essen- 
tial for the regulation of cellular processes. Thus, an 
understanding of the folding kinetics of such molecules 
is of high importance. For a detailed overview see j40| . 
Specifically, we work with the bistable RNA d33 sequence 
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FIG. 2: (Color online) RNA secondary structures 
(top/bottom) with energies — 14.4^j^ and — 14.3^g| of the 
tested bistable RNA d33 and their outer-planar linear Feyn- 
man diagrams (middle) (drawn using jViz.Rna vl.77 [381]). 
Energy evaluation and sequence design is based on Vienna 
RNA package vl.8.2 @ and the method from 



shown in Fig. [21 It has 29, 759, 371 micro-states, allowing 
for full enumeration and thus for a comparative analysis 
with our method. Out of the 3, 223 local minima, the 
two lowest are the secondary structures given in Fig. [31 
These two ground states have practically the same en- 
ergy. A walk between the ground states involves break- 
ing all base pairs, resulting in an energy barrier of height 
IS.E ~ 1.18 X 10^^^ J. The temperature for both sam- 
pling and energy calculation is T = (273.15 + 37.00) K. 
Therefore 1//3 = fcsT = 4.28 x lO"^! j is more than one 
order of magnitude below the barrier height /S.E of the 
RNA switch. 

A comparison between exact and sampled transition 
probabilities is made in terms of the average time t(6) 
from macro-state b to one of the ground states. For a 
biopolymer as considered here, r(&) is the folding time 
when starting in an initial state 6 such as the open chain. 

Given a set of target states A C B, the time to target is 
T(a) ~ when starting in one of the target states a & A 
(boundary condition) . For a starting state h £ B\A, the 
average time t(6) until first reaching one of the target 
states obeys the recursion 



t(6) = 1 + X! 1b^cT{c) 

ceB 



(8) 



The average time to target from b is one time step plus 
the time to target from the state c following b. The dis- 
tribution of c is given by the transition probability qt^c- 
Time to target is also called exit time |43| . 

Figure [3] shows that T(6)-values based on the sampled 
transition probabilities have small relative error for all 
starting macro-states b G B. With lO"' sampling steps per 
basin, the ratios between exact and approximate times r 
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FIG. 3: (Color online) Sampling precision in terms of the 
predicted average time t(6) to reach the ground state from 
a macro-state b for RNA d33. For each b, the correspond- 
ing data point gives the ratio between TsampUng based on the 
sampled transitions q' and the value Toxact from the exact 
ones q versus Texact itself. Symbols indicate number of sam- 
pling steps per macro-state as 10'^ (squares), 10* (diamonds), 
and 10^ (crosses). The target set contains both ground states. 
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FIG. 4: (Color online) Time evolution of the sampling error 
(KL divergence Eq. [9]) for the folding landscape of the RNA 
switch molecule d33. Mean, median, and maximum are for 
the distribution of KL values over the \B\ = 3223 macro-states 
(local minima). 



are in the range [0.75; 1.15]. They fall into [0.96; 1.07] 
when using 10^ steps per basin. 

To investigate the sampling error wc compare, sepa- 
rately for each macro-state 6, the exact with the esti- 
mated transition probability vectors for leaving b. We 
quantify the discrepancy between the two vectors by the 
KuUback-Leibler divergence [36| 



c£B 



(9) 



Figure 2] summarizes the evolution of the sampling er- 



FIG. 5: (Color online) Time evolution of macro-state concen- 
trations for RNA d33 from the exact transition probabilities q 
(curves) and from the estimates q' via sampling for 10* (o) 
and 10^ steps per macro-state (-I-). The concentrations both 
for the exact and the sampled transition rates are at the sta- 
tionary values for t > 10^^. See main text for discussion of 
the discrepancy in these stationary coircentrations. 



ror for increasing sampling steps per basin. As in the 
number partitioning landscape, KuUback-Leibler diver- 
gence (KL) decreases inversely proportional to the num- 
ber of sampling steps. The mean of the distribution of 
KL values across basins is larger than the median by a 
factor of 3 indicating a broad distribution. This is due 
to a broad distribution of macro-state sizes. Probability 
vectors for macro-states comprising one or a few micro- 
states reach a low error after fewer sampling steps than 
those for large macro-states. Still also the maximum of 
the error across all basins decreases proportionally to the 
average. One of the extensions of the method outlined 
in Sec. IVIll chooses a number of sampling steps individu- 
ally for each macro-state based on its estimated partition 
function. 

In FigurelSl we compare the kinetics of the molecule for 
the approximated transition probabilities via sampling 
and the exact ones obtained by enumeration of all micro- 
states of the landscape. As an initial condition we choose 
the whole ensemble to be in the macro-state of the open 
chain (structure without base pairs). As a qualitative 
description, the ensemble first populates the first and, 
somewhat later, the second ground state. On an interme- 
diate time scale (10®), an almost constant concentration 
vector is reached with the second ground state dominat- 
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ing the first. However, this plateau concentration vector 
is transient. Probabihty mass flows from the second to 
the first ground state on a slow time scale (10^^) to reach 
the stationary concentrations. 

With transition probabilities obtained by sampling for 
10^ steps per macro-state, the kinetics is reproduced with 
high precision both in the timing as well as the absolute 
concentration in the plateau where the relative error is 
below 10^^. Since we hash the probabilities for already 
visited structures, the computational effort per basin is 
dominated by the number of visited states instead of 
overall sampling steps. Thus, small basins are sampled 
faster than larger ones. When sampling 10^ steps per 
macro-state, which renders the kinetics with even higher 
accuracy, computation time is still reduced by a factor of 
« 9 compared with full enumeration. 

The stationary concentrations found at time t > 10^^, 
however, do not agree with the exact solution. A much 
larger number of sampling steps is required to match 
these. Further tests with other RNA switch molecules 
yield the same qualitative result for moderate number of 
sampling steps per basin. Both the concentration lev- 
els and the time scales are faithfully reproduced by the 
transition rates from sampling, except for the stationary 
concentrations . 

A closer look at the particular structure of the land- 
scape of RNA switches hints at an explanation for the dis- 
crepancy. Both ground states have large and deep basins. 
In RNA d33, barriers to neighboring basins are all at least 
8.6 X 10"^" J above ground state energy, which is more 
than 20 fcsT. Exits from one ground state basin towards 
the other lead through a small number of micro-states. 
When sampling the large ground state basins, these few 
salient micro-states are likely to be missed. The subtle 
balance between incoming and outgoing probability flow 
in the equilibrium is distorted. Due to the symmetry of 
the move set, however, those missed non-zero transition 
probabilities can be identified to some extent. When the 
forward rate is non-zero then the backward rate must be 
non-zero as well. For dynamics with detailed balance, as 
considered here, even quantitative correction of missed 
or undersampled rates is possible. This is suggested as 
one of the extensions in the following section. 



VII. EXTENSIONS AND MODIFICATIONS OF 
THE METHOD 

We outline ideas for varying the method to potentially 
increase efficiency and applicability in various settings. 
These are not used in the applications in Sec.lVland lVIl 

A. Guided sampling 



a subset of the set of macro-states [18[. Then Q may 
be handled as a priority queue. For instance, we may be 
interested only in transitions between macro-states below 
a certain energy threshold or those involved in typical 
trajectories. In the latter case, the next macro-state to be 
explored is the one that is reached from already explored 
macro-states with the largest probability. 

B. Partition function estimation 

In addition to transition probabilities, the canonical 
partition function 

Zi,= exp(-/3S(x)) (10) 

xe_F-i(h) 

of the macro-state b may be estimated at any time dur- 
ing the sampling. Consider a subset X C F~^{b) of 
the micro-states of macro-state b. When sampling in 
the macro-state b, the fraction of time the Markov chain 
spends in X is 

r=YT. cxp(-/3i?(x)) . (11) 

Therefore Zb can be calculated when knowing r and the 
energies of all states in X. Now X can be taken as the set 
of states visited in the first t* steps of the Markov chain, 
X = {xt I t = 1, . . . ,t*}. An estimate r' of r is obtained 
by counting how often the Markov chain visits states in 
X during a sufficiently long time interval [istart, ^stopi; 

^, ^ |{ t I ^start < t < tstop A Xt e X}\ 
t-stop ^start 

In order to obtain an unbiased estimate of r, this time in- 
terval must not overlap with the time steps during which 
X is recorded; thus istart > t* . By solving Eq. (fTTj) for 
Zb and replacing r with the estimate r', we obtain 

Z'b = -Y. eM-m^)) (13) 
as an unbiased estimate of the partition function Z^- 

C. Sampling time adjustment 

The estimate r{t) may also be used for adapting the 
length of the Markov chain exploring macro-state b to 
the size of b. The sampling will be run until the fraction 
of covered probability mass exceeds a certain threshold, 
e.g. stopping as soon as r{t) > 0.5. 

D. Landscape coarse-graining 



The stopping criterion (iv) of the outer loop for full 
transition matrix estimation (Sec. IIV[) may be modified 
if we do not aim to explore the whole landscape but only 



The state space can be coarse-grained beyond the ini- 
tially chosen macro-state state partitioning by dynami- 
cally merging macro-states |44| . For merging macro-state 
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b into macro-state a, the affected entries in the matrix q 
are replaced by 

ZaQa^c + Zbqt^c .N 
qa^c = 7,7 (14) 

qc^a = Qc^a + Qc^b (15) 

for all macro-states c ^ {a, fo}. The new diagonal element 
<ia~^a is obtained by normalization of probability. The 
row and column of macro-state b are set zero (or deleted) . 
In a separate index, the mapping of macro-state b to 
macro-state a is stored. 

Strategies for the choice of macro-states to be merged 
need to be explored yet. A reasonable starting point 
is to choose pairs of macro-states with high overlap in 
successor states, e.g. choosing a and b such that 

Qa^cQb^c (16) 

cG-B 

is maximal. 



E. Balancing the transition probability matrix 

If the micro-state dynamics in terms of the transition 
probabilities Px-^y fulfills detailed balance, then so does 
the macro-state dynamics with transition probabilities 
Qb-^c- Detailed balance means 

Zi qi^s = Zs qs^i ■ (17) 

for all macro-state pairs (Z,s). The transition probabil- 
ities obtained by the sampling, however, need not 
fulfill the same condition. By the transformation 

transition probabilities q* with detailed balance are 
obtained. The transformation also serves to impose a 
known stationary distribution of concentrations on the 
transition probability matrix. 



VIII. CONCLUSION AND DISCUSSION 

When coarse-graining the state space of an energy 
landscape into macro-states, transition probabilities be- 
tween macro-states have to be obtained in order to cap- 
ture the coarse-grained stochastic dynamics. Here we 
have introduced a sampling method that allows for a fast 
yet accurate estimation of these transition probabilities. 
We have demonstrated the scalability of the approach 
with system size for special instances of the number par- 
titioning problem. As a real-world application, we have 
analyzed the folding landscape of the secondary structure 
of an RNA switch as an example of a biopolymer. Its rich 
dynamic behavior on separate fast and slow timescales is 
accurately rendered by transition probabilities obtained 
with low computational cost. 

The general method introduced here may serve as a 
ficxible framework for stochastic exploration of energy 
landscapes. As laid out in the Sec. IVIIl several exten- 
sions and modifications may be made to obtain increased 
performance and wider applicability. In particular, the 
high variation of macro-state sizes may be exploited in a 
scheme for an automatic choice of sampling effort. Fur- 
thermore, the merging of small macro-states with larger 
neighbors during the sampling may lead to more manage- 
able and potentially more meaningful partitions of the 
landscape akin to metabasins Q. 

In ongoing and future work, the method shall be ap- 
plied to other energy landscapes including those of state- 
discrete protein folding dynamics [45|-|47[. Such land- 
scapes have been shown to be amenable to sampling ap- 
proaches [13 . Another field of application of our method 
is the clarification of concepts for dynamics on energy sur- 
faces, such as the notion of a folding funnel [ill l49l Isoj). 
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